Focuses on insights from the past and answers the question “What happened?”
Focuses on the future and addresses “What might happen next?”
Suggests decision option such as “What is the best course of action” or “what will happen if I do this”.
The expected loss arising from the model not being complex/flexible enough to capture the underlying signal.
High bias means that our model won’t be accurate because it doesn’t have the capacity to capture the signal in the data.
The expected loss arising from the model being too complex and overfitting to a specific instance of the data.
High variance means that our model won’t be accurate because it overfit to the data it was trained on and, thus, won’t generalize well to new, unseen data. To understand and measure variance, we will need training and testing samples.
Clarity of the business issue. The key to achieving clarity is asking the right questions.
Testable Hypothesis to guide the project. This will also give management a clearer picture of what can be expcted from the project.
An ability to assess the outcome based on clear and measurable KPI
The more granular a variable, the less observations each level will have. For example, year will have less levels but more observations than month or day.
We will use SOA Mortality data for this section. All illustrations are based on this dataset.
The techniques used to explore the distribution of variables depend on the type of variable(numeric vs categorical). The two types of techniques are summary statistics and data visualization.
Used to find the mean/median and percentiles of a variable.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 33.00 43.00 42.97 54.00 94.00
Variance StdDev IQR
230.93738 15.19662 21.00000
You can tell skewness from this. If the mean > median then there is right skew while if the mean < median there is left skew. The example above is very symmetrical.
Used to show the shape of the distribution of a numeric variables.
Used to determine if outliers exist in a numeric variable.
Note: Both histograms and box plots should show a visual representation of the summary statistics.
Both are used to get a sense of the distribution of a categorical variable. Since categorical variables are unordered, summary statistics won’t work here.
Frequency tables are only good when there aren’t too many levels in the variable. With higher levels, bar charts should be used to more easily visualize the distribution.
| prodcat | exposure_cnt | exposure_cnt_p | exposure_face | exposure_face_p | count | count_p |
|---|---|---|---|---|---|---|
| TRM | 19,353 | 4.47% | 3,911,705,808 | 9.32% | 26,788 | 5.36% |
| UL | 87,728 | 20.24% | 15,656,468,107 | 37.31% | 111,932 | 22.39% |
| ULSG | 32,918 | 7.60% | 10,476,583,759 | 24.96% | 49,688 | 9.94% |
| WL | 293,374 | 67.70% | 11,922,500,099 | 28.41% | 311,592 | 62.32% |
There are three combinations of variables to explore: Numeric vs Numeric, Categorical vs Categorical, and Numeric vs Categorical. There are many visualization tools that can be used for bivariate exploration:
Used to look at the distribution of a numeric variable split by a categorical variable.
This is also used to look at the distribution of a numeric variable split by a categorical variable. It is only suitable when using categorical variables with a low number of levels. Else a split box plot should be used.
Used to look at the distribution of a categorical variable split by another categorical variable.When there are too many levels the chart becomes difficult to interpret.
Used to see the relationship between two variables(whether numeric or categorical). See the examples below:
The below shows scatterplots based on subsets of the data.
A variable is one of the many columns in the original dataset. A feature is either an original variable selected to be used in a model or a feature generated from a transformed variable to be used in a model.
A variable could be stock price, while a feature generated from that variable could be change in stock price over some time period or average stock price over some period.
A term-life data set is used here.
The below scatter plot shows that there exists no linear relationships between the continuous variables of this dataset.
A log transformation can be applied to skewed (often financial) data to address the non-linear issue.
The below is a scatter with two of the variables log transformed. After transformations, the points are more spread out so that patterns, if any, can be identified.
Note: When modeling with transformed variables (as in this case), it is important to remember to transform the resulting predictions back to un-transformed numbers.
PCA is a dimensionality reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated features called principal components, while retaining as much of the original variation in the data as possible. It can only be applied on numeric data. Categorical variables have to be converted beforehand.
Principal components are linear combinations of the original variables.
We will look at data based on diamonds. As mentioned PCA can only be done on numerical variables. Let’s look at the numerical variables in the diamonds data.
| carat | depth | table | price | x | y | z |
|---|---|---|---|---|---|---|
| 0.23 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 0.21 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
| 0.23 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
| 0.29 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
| 0.31 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
We will use carat, depth, x, y and z to conduct PCA.
The two main outputs from PCA is the summary and the loadings.
For the summary focus on the proportion of variance for each PC mainly and the standard deviation of each PC secondarily.
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 1.9890 1.0051 0.17862 0.03540 0.02591
Proportion of Variance 0.7912 0.2020 0.00638 0.00025 0.00013
Cumulative Proportion 0.7912 0.9932 0.99962 0.99987 1.00000
The rotations show how each variable contributes to the PCs. The sum of squares of the loadings for each PC should equal 1.
PC1 PC2 PC3 PC4 PC5
carat -0.49668722 0.005903109 -0.86783797 0.009262406 -0.006199172
depth -0.01409718 0.994550858 0.01548542 -0.007080449 -0.101881938
x -0.50129255 -0.050645176 0.28094608 -0.747021349 -0.330407707
y -0.50111275 -0.055294983 0.29683507 0.660069453 -0.471196063
z -0.50069438 0.072189161 0.28209166 0.078303871 0.811410290
PCA can be visualized using biplots.
K-Means is an unsupervised learning algorithm that partitions data into K distinct non-overlapping clusters. K is specified upfront. K=5 or K=10 is usually the default. Needs standardization prior to clustering. Designed for continuous variables.
Randomly place K centroids in the data space. We can select K by using the elbow method (creating an elbow plot).
Assign each observation to the nearest centroid based on Euclidead distance.
Recalculate each centroid as the mean of all observations assigned to that cluster.
Keep updating between steps 2 and 3 until the centroids stabalize.
The algorithm minimizes “within-cluster sum of squares”—basically making each cluster as tight and homogeneous as possible.
An unsupervised learning method that builds a hierarchy of clusters in a tree like structure(dendrogram) that shows how the observations are grouped together at different levels of similarity.
Exploratory Analysis: When you don’t know how many groupings may exist in your data, the dendrogram shows all possibilities
Nested Segmentation.
There are two methods:
The distance here is based on linkage method for measuring distance between clusters which will be discussed below.
Recalculate distance between the new cluster and all other clusters.
Decide where to cut the dendrogram to get the final k clusters.
The method of calculating distance between clusters.
Distance between closest points in two clusters.
Distance between farthest points.
Average distance between all pairs of points.
Minimizes within cluster variance. Similar to k-means.
If target variable is count type, check dispersion: - Equidispersion(variance=mean): Use poisson. - Overdispersion(variance>mean): Use negative binomial
If target variable is positive continuous:
Binary (0/1, Yes/No)
Proportion/Probability (between 0 and 1)
Count (0, 1, 2, 3, …)
Continuous - Any Value (-∞ to +∞)
Continuous - Positive Only (>0)
Calculate the variance-to-mean ratio from your data.
Ratio ≈ 1 (between 0.8 and 1.2)
Ratio > 1.2
Excess Zeros (>60-70% zeros AND poor model fit)
Data Contains Zeros (e.g., many policies with $0 claims)
Distribution: Tweedie (compound Poisson-Gamma, with 1 < p < 2)
Data is Strictly Positive (No Zeros)
Check the shape of the distribution:
Symmetric around the mean
Right-Skewed (long tail to the right)
Choosing between Gamma and Inverse Gaussian:
Log Link (Poisson, Negative Binomial, Gamma, Tweedie)
Logit Link (Binomial)
Identity Link (Normal)
| Target Variable Type | Distribution | Link | Coefficient Interpretation |
|---|---|---|---|
| Binary (Yes/No) | Binomial | Logit | exp(β) = Odds Ratio |
| Proportion (0-1) | Binomial | Logit | exp(β) = Odds Ratio |
| Count (variance ≈ mean) | Poisson | Log | exp(β) = Multiplicative |
| Count (variance > mean) | Negative Binomial | Log | exp(β) = Multiplicative |
| Count (excess zeros) | ZIP/ZINB | Log | exp(β) = Multiplicative |
| Continuous (any value) | Normal | Identity | β = Additive |
| Continuous (positive, symmetric) | Normal | Identity | β = Additive |
| Continuous (positive, skewed) | Gamma | Log | exp(β) = Multiplicative |
| Continuous (positive, with zeros) | Tweedie | Log | exp(β) = Multiplicative |
Deviance is a measure of goodness of fit of a GLM(similar to sum of squares). The default value is the null deviance which is the deviance measure when the target is predicted using the sample mean(similar to Total Sum of Squares).
Residual vs fitted plots check the homogeneity of the variance and the linearity of the relationship.
Offsets are coefficients that are already known and so do not need to be estimated. Offsets handle known differences in exposure or scale across observations. When your response variable represents counts or totals accumulated over different time periods, geographical areas or population size, you can’t directly compare them without accounting for these differences, hence the use of offsets.
Example If modelling claims counts and one policy holder had coverage for .5 years while another had coverage for 1.5 years, comparing their raw claim counts would be misleading. The offset adjusts for this.
Prior weights are used when data is aggregated into a single record line. Prior weights basically tell the modeler to treat this record as x amount of observations.
Effect on Model Fitting Prior weights affect the deviance and degrees of freedom:
Disagreggated data(3 rows)
| Age | Claims | Exposure |
|---|---|---|
| 25 | 1 | 1.0 |
| 25 | 0 | 1.0 |
| 25 | 2 | 1.0 |
Aggregated data(1 row)
| Age | Claims | Exposure | Weight |
|---|---|---|---|
| 25 | 3 | 3.0 | 3 |
The difference between these is their roles in the model structure.
Observations should be standardized before regularization.
Hyperparameters - Lambda(\(\lambda\)): penalty parameter - Alpha(\(\alpha\)): proportion between ridge and lasso regression; used in elastic net.
Below is an example of a basic decision tree. In each node: - The top label is the predicted class - The middle number is the predicted probability of the majority class (in this case M) - The bottom number is the percentage of observations in the node
Impurity measures how “mixed” the classes are within a node. A pure node contains observations from only one class. The goal is to reduce impurity at each split.
Three main impurity metrics are used:
\(Gini(N) = 1 - \sum_{i=1}^c p_i^2\)
\(Entropy(N) = - \sum_{i=1}^c p_i log_2 (p_i)\)
Information gain
\(Classification Error(N) = 1 - max_{i=1..c} p_i\)
This controls the minimum number of observation that must exist in a node to split said node. The evaluation is done prior to the split.
This controls the minimum number of observations that must exist in the new leaf node after a split is done. The evaluation is done after the split and so if the evaluation is not passed, then the split is reversed.
This controls the minimum impurity reduction required for a split to be made. If Cp is .01 and the error reduction is less than .01, then the split won’t be made.
Below is an example of a Cp Table. We look for the lowest xerror which is highlighted in the table below.
| CP | nsplit | rel error | xerror | xstd |
|---|---|---|---|---|
| 0.7919 | 0 | 1.0000 | 1.0000 | 0.0648 |
| 0.0604 | 1 | 0.2081 | 0.3087 | 0.0428 |
| 0.0268 | 2 | 0.1477 | 0.2550 | 0.0394 |
| 0.0201 | 4 | 0.0940 | 0.2685 | 0.0403 |
| 0.0134 | 6 | 0.0537 | 0.2282 | 0.0374 |
| 0.0067 | 7 | 0.0403 | 0.2215 | 0.0369 |
| 0.0000 | 8 | 0.0336 | 0.2081 | 0.0359 |
This controls how many many levels of nodes are allowed in a tree. The root node is counted as depth 0 with the child nodes of this node counted as depth 1 and so on.
After determing the most optimal Cp, we can create a new tree from scratch using the Cp value. One downside to this method however, is that sometimes a good split may come after a bad split but our model will never reach these because the bad split did not pass the Cp requirement.
One way to avoid this is by building a fully complex tree then pruning backwards. This is called cost complexity pruning.
Cost complexity pruning begins with a full tree(using a 0 Cp value) then removes the least important splits according to the optimal Cp value.
It is grown similarly to the classification tree. Instead of using impurity measure like Gini, we use RSS.
Variable importance shows the ordering of variables according to their contribution to the model.
| Variable | Importance |
|---|---|
| concave.points_mean | 134.5302 |
| perimeter_worst | 113.6826 |
| radius_worst | 113.3666 |
| concave.points_worst | 112.9353 |
| concavity_mean | 109.4282 |
| concavity_worst | 90.3203 |
| area_worst | 25.7681 |
| texture_worst | 17.4773 |
| area_mean | 14.3656 |
| perimeter_mean | 14.3656 |
For classification, the simplest form of model assessment is using a confusion matrix.
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | TP | FP |
| Predicted Negative | FN | TN |
| High | Low | |
|---|---|---|
| High | 128 | 9 |
| Low | 346 | 466 |
From the confusion matrix we can derive several performance measures: accuracy, precision, sensitivity/recall.
The proportion of all predictions (both positive and negative) that were correct.
\(accuracy = \frac{TP + TN}{N}\)
The proportion of positive predictions that were actually positive. When positive is predicted, how often is it right?
\(precision = \frac{TP}{TP+FP}\)
The proportion of actual positives that were classed correctly. Out of all actual positives, how many were caught?
\(sensitivity = \frac{TP}{TP+FN}\)
The proportion of actual negatives that were classed correctly.
\(specificity = \frac{TN}{TN+FP}\)
Another form of model assessment for classification is the ROC curve. This curve compares the true positive rate(sensitivity) and the false positive rate(1 - sensitivity).
A cut-off value is used as a threshold. Such that if a cut off value is .8 and a node has .75 Yes and .25 No, it will still be classed as No even though the majority class is Yes.
Ensembles many decision trees built on bootstrapped samples and random subsets of predictors. This reduces variance by averaging many independent trees, while keeping bias similar to a single deep tree.
Random Forrest Algorithm 1. Training a. Get a random sample of observations(with replacement) from the training data. b. For each split, choose among a random sample of of features to determine that split(without replacement). c. Train the decision tree on the above. d. Repeat
The number of trees to grow. The more trees the better, especially for datasets with a large number of observations or predictors.
The proportion of observations in each random sample to build each individual tree. Must ensure each observations is used at least once. So if the proportion of observations is small then the number of trees need to be larger.
The proportion of features to be used at each split. This parameter is usually tuned as part of the model fitting process.
Builds trees sequentially, where each new tree attempts to correct the errors of the previous ones. This reduces bias, but can increase variance, making tuning/regularization important.